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Abstract 

The purpose of the current study was to investigate the effectiveness of four methods of 
handling missing data. Effectiveness was defined as the probability of accurately reproducing the 
covariance matrix of the target sample. Effectiveness of the missing data methods was assessed 
by manipulating the proportion of cases containing missing values and the sample size. This study 
replicated Witta’s (2000) study with an addition to the sampling procedure. The pattern of 
missing values was used to create missing values in complete cases and compared to the case 
replacement method. Thus, the current study was also testing the effects of the case replacement 
sampling procedure used by Witta and the variable value replacement method used in this study. 
Results indicate there are differences in decisions concerning effectiveness of a missing data 
method used due to the way the sample was created and the proportion of missing values used. 
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Does Method of Creating the Sample Influence 
Missing Data Decisions? 

When data are analyzed in survey research, often there are missing values. If the 
mechanism causing the missing values is known, the solution to this problem may be incorporated 
in the study. Inevitably, however, when data are collected by survey, subjects may foil to answer 
some questions for reasons unknown to the researcher. Ignoring this problem may lead to 
analysis of data that is of dubious value. 

In addition, different methods of handling missing values may produce different results. 
When Jackson (1968) entered data on all the available variables in a discriminant analysis, the 
significance of the regression coefficients of individual variables, as well as the interpretation of 
the importance of these variables, changed with the missing value method used. Witta and Kaiser 
(1991) also reported that the regression coefficients and total variance accounted for by the 
variables changed depending on the method used to handle missing values. After re-analyzing 
three studies of private/public school achievement, Ward and Clark III (1991) concluded that the 
method used to handle missing data influenced the outcome of these studies. 

When using the National Educational Longitudinal Study of 1988 database to investigate 
the effects of part-time work on school outcomes Singh and Ozturk (1999) eliminated more than 
half of the selected cases by listwise deletion of the incomplete data. The question then became 
was listwise deletion an appropriate method of for handling the missing data or, would another 
method be more effective? Witta (2000) investigated this question by stratifying the population 
into complete and incomplete cases and randomly selecting samples from each population to 
create data sets with varying proportions of incomplete cases. This method of sample selection 
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suspect because the cases were then tested against the complete sample. 

Statement of the Problem 

The purpose of the current study was to determine if there was a difference in results if the 
incomplete cases were randomly selected and replaced complete ones or if values in complete 
cases were deleted (missing) based on the pattern of missing values in the incomplete sample. The 
effectiveness of four methods of handling missing data using the 26 variables in the Singh and 
Ozturk study was assessed under both sample creation conditions. Effectiveness was defined as 
the probability of accurately reproducing the true covariance matrix. Effectiveness of the missing 
data methods was assessed by manipulating the proportion of cases containing missing values and 
the sample size. The missing data methods studied were listwise deletion, pairwise deletion, 
regression and expectation maximization. Sample sizes investigated were 500, 1000, and 2000. 
The proportion of incomplete cases in each sample were 30%, 50%, and 70%. 

Until recently, the only methods available with popular statistical computer software 
focused on handling the missing data problem by deleting subjects with incomplete information, 
deleting the variables with missing values, or replacing the missing value with some reasonable 
estimate. Now, however, new subroutines are available to provide more assistance in handling 
missing data and providing analysis choices using iterative regression or expectation maximization 
(EM) procedures. These relatively new methods (in current software) also provide the possibility 
of specifying the model to be used (i.e., multivariate normality, adding a randomly selected error). 

Methods Studied 

Listwise Deletion 



Listwise deletion is probably the most frequently used method of handling missing data 
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and is available as a default option in several statistical software programs including SPSS. This 
method discards cases with a missing value on any variable and thus is very wasteful of data. 
Listwise deletion, however, has been shown to be effective with low average intercorrelation, less 
than four variables and a small proportion of missing values (Chan, et.al., 1976; Haitovsky, 1968; 
Timm, 1970). The assumption of missing completely at random is crucial to the use of this 
method. It is more likely, however, to find the complete sample different in important ways from 
the incomplete sample (Little & Rubin, 1987). Problems for a researcher using this method 
include a reduction in power and an increase in standard error due to reduced sample size and the 
possible elimination of sub-populations. 

Pairwise Deletion 

When using pairwise deletion, covariances are computed between all pairs of variables 
having both observations, eliminating those that have a missing value for one of the two variables 
(Glasser, 1964). Means and variances are computed on all available observations. The 
assumption made is that the use of the maximum number of pairs and all the individual 
observations yield more valid estimates of the relationship between the variables. It is assumed 
that when two variables are correlated, information on one improves the estimates of the other 
variable. It is also assumed that the pairs are a random subset of the sample pairs. If these 
assumptions are true, pairwise deletion produces unbiased estimates of the variable means and 
variances (Hertel, 1976). When missing data are not missing completely at random, however, the 
correlation matrix produced by pairwise deletion may not be Gramian (Norusis, 1988). 

Marsh (1998) investigated the estimates produced when using pairwise deletion for 
randomly missing data. From this study, which included five levels of missing data and three 
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sample sizes, Marsh concluded parameter variability was explained, parameter estimates were 
unbiased, and only one covariance matrix was nonpositive definite. 

Regression 

Regression as an imputation method has many variations. The regression methods rely on 
information contained in non-missing values of other variables to provide estimates of missing 
values. In this procedure variables containing a missing value in one or more cases are regressed 
on the other variables. The resulting regression equation is used to provide an estimate which 
replaces the missing value. Then the process is repeated until estimates do not change. 

As the average intercorrelation and the number of variables from which these methods can 
obtain information increases, the regression methods, theoretically, perform better. Too many 
variables, however, can cause problems with over prediction (Kaiser & Tracy, 1988) and too high 
an average intercorrelation can result in a singular matrix. In these cases, regression does not 
perform well. 

Variations in the regression methods include differences in methods of developing the 
initial correlation matrix (listwise deletion, pairwise deletion, and mean substitution) and the 
presence or absence of iteration procedures. Differences in regression methods also include the 
use of randomly selected residuals for iterations and assumptions of a normal distribution. 
Theoretically, the more variables considered that provide additional information, the better the 
estimate. Mundfrom and Whitcomb (1998) investigated the effects of using mean substitution, 
hot-deck imputation, and regression imputation on classification of cardiac patients. Mean 
substitution and hot-deck imputation correctly classified patients more frequently than regression 



imputation. 
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Expectation Maximization 

Dempster, Laird, and Rubin (1977) recommended the use of the EM algorithm which 
imputes estimates simultaneously in an iterative procedure. The E step of this algorithm finds the 
conditional expectation of the missing values. The M step performs maximum likelihood 
estimation as if there were no missing data. The primary difference between this procedure and 
the regression procedure is that the values for the missing data are not imputed and then iterated. 
The missing values are functions based on the conditional expectation (Little & Rubin, 1987). 

T his method of hand ling missing data represents a fundamental shift in the way of thinking about 
missing data (Schafer & Olsen, 1998). 

Pattern of Missing Values 

All of the missing data handling procedures discussed require data missing at random 
(MAR) or missing completely at random (MCAR). Yet Cohen and Cohen (1983) suggested that 
in survey research the absence of data on one variable may be related to another variable and may 
be due to the value of the variable itself. When investigating simultaneously missing values, Witta 
(1996/97) found concurrently missing values (pc.OOl) in three of four samples using data from a 
national database. 

Missing Completely at Random refers to no relationship between the missing value for one 
variable and missing values for other variables or between the missing value and values for other 
variables. The missing data subroutine in SPSS uses Little’s test to evaluate this relationship. 
Because Missing at Random refers to the missingness due to the actual value of the variable (i.e., 
if salary is too high the respondent may refuse to answer), this procedure cannot be tested. 

Schafer and Olsen (1998), however, argue convincingly that “every missing-data method must 
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make some largely untestable statistical assumptions about the manner in which the missing values 
were lost” (p551). Consequently, when analyzing real data, researchers typically assume missing 
at random. 

Procedure 

All high school seniors who had reported working during their senior year of high school 
and for whom base-year and first follow-up data were available were included in this study. The 
initial sample contained the 26 variables used in the Singh and Ozturk study for 4664 subjects. 
These subjects were split into three populations: those containing one or more missing values but 
less than 14 (n=1542), those containing more than 13 missing values (n=19), and those containing 
no missing values on any variable (n=3103). The 19 subjects having missing values for more than 
half the variables were eliminated from further analysis. The remaining two populations (n=4645) 
were used to create samples for analysis. 

Creating Test Samples 

A sample consisting of 2000 cases was randomly selected from the non-missing 
population. This sample was duplicated twice resulting in three identical samples of 2000 cases 
containing no missing values. The first sample was used to provide estimates of the target (true) 
covariance matrix. 

A sample of 1400 cases was randomly select from the missing population. The pattern of 
missing values was recorded. This pattern was used to create missing values in 1400 of the 
complete cases in the second target sample. In addition, these cases were used to replace an equal 
number of randomly selected cases in the third target sample. This provided two test samples of 
2000 with 70% of the cases containing missing values (one containing replacement cases and one 
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having missing values created) and one target sample. This process was repeated to provide a test 
samples with 50% (1000) of the cases containing missing values. The process was repeated again 
to provide a test samples with 30% (600) of the cases containing missing values. This resulted in 
9 samples of 2000 cases; 3 complete target samples and 2 test samples for each proportion of 
incomplete cases (30%, 50%, and 70%). 

This entire procedure was repeated twice to provide test samples with 30%, 50%, and 
70% of the cases containing missing values in test samples of 1000 and 500 cases. Thus, 27 
samples were created; 9 test samples with replacement of the complete cases with incomplete 
ones, 9 test samples with missing values based on the pattern of the incomplete cases, and 9 
complete target samples. 

Analysis 

Covariance matrices for the missing data handling methods were produced by the missing 
data subroutine in SPSS. The test for missing completely at random and pattern of missing data 
was also produced by this subroutine. After treatment by each missing data handling method, 
multi-sample analysis in LISREL 8.3 (Joreskog & Sorbom, 1996, chap. 9) was used to test the 
equality of the covariance matrices produced by various missing data handling methods to the 
covariance matrix of the target sample. 

Results 

Randomness of Missing Values 

Few of the samples used in the current study contained data missing completely at 
random. The frequency of simultaneously missing variables for each sample is presented in Table 
1. The category of ‘Test’ consists of four simultaneously missing standardized test variables 
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(History, Math, Reading, and Science). The standardized test variables were also missing in 
conjunction with missing values for grades. The four grade variables were also missing 
simultaneously. If a variable did not contain a missing value for 10% of the sample cases (either 
alone or concurrently with other variables), it was included in the ‘Other’ category. In each 
sample, the majority of the cases containing missing values consisted of concurrently missing 
values for standardized tests (the categories ‘Test’ and ‘Grades & Test’). 



Insert Table 1 About Here 



Covariance Matrix Reproduction 

All four missing data methods adequately reproduced ( % 2 p>.05) the target sample 
covariance matrix when 30% or 50% of the cases contained missing values regardless of sample 
size 1 . In addition, as depicted in Tables 2 and 3, the normed fit index (NFI) in all cases was above 
0.98 and the root mean square residual (RMSEA) was 0.0. The listwise deletion and regression 
methods produced identical values for the two sampling conditions (replacement by pattern of 
missing values and replacement of cases by randomly selected cases). However, the pairwise 
deletion and EM algorithm methods produced much lower actual % 2 values when the sample was 
created using replacement by pattern of missing values. 



’To prevent discrepancies in sample size comparison, the n for testing the covariance 
matrices produced by Listwise and Pairwise deletion was enter in LISREL as the target n (i.e. if 
the target sample contained 500 cases, the n entered for the listwise deletion covariance matrix 
was 500). 
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Insert Tables 2 & 3 About Here 



When complete cases were replaced with incomplete ones and 70% of the cases contained 
missing values, only the covariance matrix produced by the EM algorithm passably reproduced 
the target sample matrix when the sample size was 1000. Under the same conditions, when the 
sample size was 500 or 2000, no method adequately reproduced the target sample covariance 
matrix as measured by chi-square (% 2 , p<.05) . When, however, the pattern of missing values was 
used to create missing data in complete cases and 70% of the cases were incomplete, pairwise 
deletion and the EM algorithm adequately reproduced the target sample covariance matrix under 
all conditions. The normed fit index remained at an acceptable level of 0.96 or higher under all 
conditions. The root mean square residual also remained relatively small as shown in Table 4. 



Insert Table 4 About Here 



Discussion and Conclusions 

When 30% or 50% of the cases in a sample were incomplete, all missing data methods 
tested adequately reproduced the target sample covariance matrix regardless of sample size. There 
was, however, a difference in the X 2 statistic for the EM algorithm and pairwise deletion based on 
sampling method. In both instances if missing values were created in complete cases using the 
pattern of the incomplete ones, X 2 was smaller. When 70% of the cases were incomplete only 
pairwise deletion and the EM algorithm could adequately reproduce the target sample covariance 
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matrix when using the missing data pattern. This finding suggests that method of creating a 
sample for testing does indeed influence the results of a missing data analysis. 

When using listwise deletion or regression, the method used to create missing values did 
not change results. This, of course, was due to the fact that listwise deletion uses only complete 
cases and in the comparative samples the complete cases were identical. Although the regression 
method used in this study was iterative, imputations were not used in each iteration and were 
based on the listwise deletion matrix. Consequently, the imputations were based upon the same 
data and were identical regardless of method used to create missing values. 

Both pairwise deletion and the EM algorithm, however, used all known values. When the 
complete cases were replaced by incomplete ones, the values of the variables that were not 
missing for the incomplete case also replaced the variable values for the complete ones. When the 
missing data pattern was used to delete values from complete cases, the values that were not 
deleted did not change. Consequently, the pairwise deletion and EM algorithm estimates were 
based upon different values for each sample. 

T his study was limited to one sample size and proportion of incomplete cases for each 
test. Therefore, results may be specific to these samples. However, the results from this study 
show that replacement of complete cases with incomplete ones is an inappropriate method to use 
in study missing values. The results again indicate that if the proportion of incomplete cases is 
relatively s mall (30% or 50%) all methods used to handle missing values could adequately 
reproduce the covariance matrix of the target sample. Under these conditions, the researcher may 
choose which missing data handling method based upon substantive reasons. When the proportion 
of incomplete cases is large, only pairwise deletion and the EM algorithm could adequately 
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reproduce the target covariance matrix. Under these conditions, the researcher should use one of 
these methods. 
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Table 2 

Results when 30% of the Cases were Incomplete 



Missing Data 


Replacement - Pattern of 
Incomplete Cases 


Replacement - Incomplete 
Cases 


Method 


t 


RMSEA 


NFI 


t 


RMSEA 


NFI 


n=500 


Listwise 


100.19 


0.0 


.99 


100.19 


0.0 


.99 


Pairwise 


33.35 


0.0 


1.0 


120.98 


0.0 


.99 


EM . 


30.31 


0.0 


1.0 


116.90 


0.0 


.99 


Regression 


99.66 


0.0 


.99 


99.66 


0.0 


.99 


o 

o 

o 

H 

b 


Listwise 


98.58 


0.0 


1.0 


98.58 


0.0 


1.0 


Pairwise 


31.40 


0.0 


1.0 


147.76 


0.0 


.99 


EM 


25.82 


0.0 


1.0 


141.77 


0.0 


.99 


Regression 


99.37 


0.0 


1.0 


99.37 


0.0 


1.0 


n=2000 


Listwise 


79.73 


0.0 


1.0 


79.73 


0.0 


1.0 


Pairwise 


25.85 


0.0 


1.0 


158.76 


0.0 


1.0 


EM 


23.03 


0.0 


1.0 


150.99 


0.0 


1.0 


Regression 


81.91 


0.0 


1.0 


81.91 


0.0 


1.0 



Note. RMSEA = Root Mean Square Error of Approximation. NFI = Normed Fit Index. ** p<.01. 
* p<.05. 
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Table 3 

Results when 50% of the Cases were Incomplete 



Missing Data Replacement - Pattern of Replacement - Incomplete 

Incomplete Cases Cases 



Method 


t 


RMSEA 


NFI 


t 


RMSEA 


NFI 


n=500 


Listwise 


204.66 


0.0 


.98 


204.66 


0.0 


.98 


Pairwise 


63.47 


0.0 


.99 


230.41 


0.0 


.98 


EM 


52.24 


0.0 


1.0 


209.12 


0.0 


.98 


Regression 


206.33 


0.0 


.98 


206.33 


0.0 


.98 


n=1000 


Listwise 


194.10 


0.0 


.99 


194.10 


0.0 


.99 


Pairwise 


54.18 


0.0 


1.0 


233.52 


0.0 


.99 


EM 


47.26 


0.0 


1.0 


224.22 


0.0 


.99 


Regression 


191.33 


0.0 


.99 


191.33 


0.0 


.99 


n=2000 


Listwise 


205.20 


0.0 


.99 


205.20 


0.0 


.99 


Pairwise 


54.50 


0.0 


1.0 


326.51 


0.0 


.99 


EM 


43.90 


0.0 


1.0 


313.80 


0.0 


.99 


Regression 


204.79 


0.0 


.99 


204.76 


0.0 


.99 



Note. RMSEA = Root Mean Square Error of Approximation. NFI = Normed Fit Index. ** p<.01. 
* p<.05. 
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Table 4 

Results when 70% of the Cases were Incomplete 



Missing Data 


Replacement - Pattern of 
Incomplete Cases 


Replacement - Incomplete 
Cases 


Method 


X 2 


RMSEA 


NFI 


t 


RMSEA 


NFI 


n=500 


Listwise 


515.79* 


0.03 


.96 


515.79* 


0.03 


.96 


Pairwise 


114.87 


0.0 


.99 


412.07* 


0.02 


.96 


EM 


104.86 


0.0 


.99 


396.72* 


0.01 


.97 


Regression 


519.05* 


0.03 


.96 


519.05* 


0.03 


.96 


n=1000 


Listwise 


444.59* 


0.02 


.98 


444.59* 


0.02 


.98 


Pairwise 


116.84 


0.0 


1.0 


402.53* 


0.01 


.98 


EM 


82.02 


0.0 


1.0 


384.80 


0.0 


.98 


Regression 


441.73* 


0.02 


.98 


441.73* 


0.02 


.98 


n=2000 


Listwise 


503.41* 


0.02 


.99 


503.41* 


0.02 


.99 


Pairwise 


106.21 


0.0 


1.0 


544.89* 


0.02 


.99 


EM 


92.26 


0.0 


1.0 


529.91* 


0.02 


.99 


Regression 


510.22* 


0.02 


.99 


510.22* 


0.02 


.99 



Note. RMSEA = Root Mean Square Error of Approximation. NFI = Normed Fit Index. ** p<.01. 



* p<.05. 
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Table A-l 



Study Questions and their suggested Construct 



Construct 


Variable 

Code 


Question 


Part-time Work 
(Grades 10 & 12) 


F1S85 

F2S88 


HOW MANY HRS DOES R USUALLY WORK A WEEK 
CURRENT JOB, # HRS WORKED DURING SCHL YR 


Attendance 
(Grade 10) 


F1S10A 
F1S10B 
FIS 13 


HOW MANY TIMES WAS R LATE FOR SCHOOL 
HOW MANY TIMES DID R CUT/SKIP CLASSES 
HOW MANY DAYS WAS R ABSENT FROM SCHOOL 


Attendance 
(Grade 12) 


F2S9A 

F2S9B 

F2S9C 


HOW MANY TIMES WAS R LATE FOR SCHOOL 
HOW MANY TIMES DID R CUT/SKIP CLASSES 
HOW MANY TIMES DID R MISS SCHOOL 


Participation 
(Grade 10) 


F1S40A 

F1S40B 

F1S40C 


OFTEN GO TO CLASS WITHOUT PENCIL/PAPER 

OFTEN GO TO CLASS WITHOUT BOOKS 

OFTEN GO TO CLASS WITHOUT HOMEWORK DONE 


Participation 
(Grade 12) 


F2S24A 

F2S24B 

F2S24C 


GO TO CLASS WITHOUT PENCIL/PAPER 

GO TO CLASS WITHOUT BOOKS 

GO TO CLASS WITHOUT HOMEWORK DONE 


Homework 
(Grade 10) 


F1S36A1 

F1S36A2 


TIME SPENT ON HOMEWORK IN SCHOOL 
TIME SPENT ON HOMEWORK OUT OF SCHOOL 


Homework 
(Grade 12) 


F2S25F1 

F2S25F2 


TOTAL TIME SPENT ON HMWRK IN SCHOOL 
TOTAL TIME SPENT ON HMWRK OUT SCHL 


Grades 12 


F2RHENG2 AVERAGE GRADE IN ENGLISH (HS+B) 
F2RHMAG2 AVERAGE GRADE IN MATHEMATICS (HS+B) 
F2RHSCG2 AVERAGE GRADE IN SCIENCE (HS+B) 
F2RHSOG2 AVERAGE GRADE IN SOCIAL STUDIES (HS+B) 


Standardized Tests 
(Grade 12) 


F22XHSTD HISTORY/CIT/GEOG STANDARDIZED SCORE 
F22XMSTD MATHEMATICS STANDARDIZED SCORE 
F22XRSTD READING STANDARDIZED SCORE 
F22XSSTD SCIENCE STANDARDIZED SCORE 
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